A Small Domain Lower Bound For Parallel Maximum Computation
نویسنده
چکیده
Recent work BJKTV has shown that parallel algorithms that are sensitive to the size of the input domain can improve on more general parallel algorithms The cited paper demonstrates an O log log log s step algorithm on an n processor CRCW PRAM for nding the pre x maxima of n numbers in the range s This paper proves a lower bound demonstrating that no algorithm is asymptotically faster as a function of s by showing that for s logn log logn the upper bound is tight Introduction Few techniques exist to show general lower bounds for parallel com putation One of the most useful ones has been the application of powerful methods from Ramsey theory Intuitively a Ramsey like theorem states that in some large and possibly complex universe there exists a subuniverse with some simpler or more regular structure To prove a lower bound on the complexity of a problem it is often possible to take an arbitrary program which may exhibit complex behaviour when considered over all inputs and apply Ramsey theory to show that there exists a subdomain of inputs on which the program behaves in very simple ways In e ect the program is reduced to operating in a structured fashion or with a restricted set of operations Ad hoc techniques can then be used to prove a lower bound on the running time of the program on this subdomain In this fashion the following lower bounds have been proved An plogn lower bound on searching in a sorted table of size n with an EREW PRAM S An plog n lower bound on sorting n items with an n processor Priority CRCW PRAM MW An plog n lower bound on deciding element distinctness of n items with an n processorCommon CRCW PRAM RSSW This was improved in Bo to the optimal result log n log log n An optimal log log n lower bound on merging two sequences of length n with an n log n processor Priority CRCW PRAM BBGSU An log log log n lower bound on simulating a n processor Arbitrary PRAM on an n processor Collision PRAM GR This was improved in C to log log n One of the drawbacks of these uses of Ramsey theory is the fact that in order to show that the subdomain exists the domain size must be a very rapidly growing function of n The possibility thus exists that if inputs are taken from the domain s where s may be polynomial or even singly or double exponential in n then algorithms may exist which Author s address Department of Computer Science University of Waterloo Water loo Ontario Canada N L G Electronic mail plragde maytag waterloo edu beat these lower bounds As an analogy consider the case of sequential sorting which has an n log n lower bound on the RAM model Radix sort will for suitably restricted domains give an O n algorithm The challenge then is to either reduce the domain size required in the lower bounds or to produce algorithms with better running times on moderate sized domains BH improves both the asymptotic result and the domain size for the sorting bound mentioned above by proving an log n log log n lower bound on computing parity with a Priority CRCW PRAM This implies the same lower bound for sorting with domain size E has obtained the same lower bound as Bo for element distinctness but with a domain size that is doubly exponential in n When improved algorithms can be found new impetus is given to the lower bound e ort This was the case when BJKTV reported two interesting algorithms for the problems of merging and maximum mentioned above In the case of merging two sorted lists drawn from the domain s they give an O log log log s algorithm on a CREW PRAM This is remarkable given that even computing the OR of n bits on a CREW PRAM requires log n time In the case of maximum nding they are able to nd the maximum of n numbers from domain s in time O log log log s on a Priority CRCW PRAM in fact the pre x maxima can be computed in this time bound In this paper we show a value of s for which log log log s time is required to compute the maximumof n numbers on an n processorPriority CRCW PRAM thus demonstrat ing that the domain sensitive result cannot be improved without further restriction on s This represents a modest beginning to the search for lower bound techniques that work on problems de ned over small domains The Upper Bound For completeness we brie y sketch the domain sensitive upper bound for nding the maximum which is claimed but not elaborated upon in BJKTV First we give a fast domain sensitive algorithm that works with more than n processors in fact with a number of processors that is also a function of the domain size Theorem An n log s processor CRCW PRAM can nd the maximum of n numbers in the domain s in constant time Proof For ease of presentation consider s to be a power of The input numbers x x xn are log s bits long let xi be the rst high order k bits of xi We label the processors Pi j where i ranges from to n and j from to log s We label s locations in memory A where is a string of bits of length at most log s Location A will be used to indicate whether there exists an input xi such that followed by the bit is xi for k j j Finally we label n locations in memory Bi where i varies between and n Location Bi will be used to indicate whether there is an input with higher value than xi In the rst step each processor Pi j reads xi Then processor Pi j writes to Axj i if the jth bit of xi is This sets the A s as stated above In the second step Pi j reads Axj i if the jth bit of xi is There is a input of value greater than xi if and only if no processor Pi j read at this step since any greater value will have some common pre x with xi and then have a where xi has a Consequently Bi can be set by having any processor Pi j that read in the second step write the value to Bi The maximum input value xi is the only value for which Bi Next we need a fast algorithm that is not domain sensitive but uses more than n processors Theorem K An n processor CRCW PRAM can nd the maximum of n numbers in O time Proof Let the processors be labelled Pi j for i j n Processor Pi j reads the cells containing the ith number and the jth number and writes over whichever one is smaller The only number not overwritten by is the maximum We use this second algorithm to design a fast algorithm that is not domain sensitive but uses only n processors Theorem SV An n processor CRCW PRAM can nd the maximum of n numbers from an unrestricted domain in O log logn time Proof The algorithm proceeds in phases starting with phase At the beginning of phase i there are n i candidates remaining that could be the maximum The candidates are divided into n i groups of size i Each group is assigned a number of processors equal to half the square of its size that is each group gets i processors The maximum of each group is found in constant time using the algorithm of Theorem Each group contributes this one candidate to the next phase this leaves the claimed number of candidates at the beginning of phase i as required It is easy to see that O log log n phases are needed Finally we can describe the domain sensitive algorithm that uses only n processors Theorem An n processor CRCW PRAM can nd the maximum of n numbers in the range s in O log log log s time Proof The numbers can be divided into groups of size log s and log s processors assigned to each group The maximum of each group can be found in O log log log s time using the algorithm of Theorem This leaves n log s candidates for the global maximum and using n processors and the algorithm of Theorem the maximum can be found in constant time Note that the algorithm claimed in BJKTV is actually more general than this as it nds pre x maxima The Lower Bound The lower bound given here follows the general outlines of other PRAM lower bounds FMW FRW GR RSSW The input to a PRAM will be an n tuple of positive integers x x xn where xi is drawn from the domain s and is initially stored in the local memory of processor Pi Since memory is unbounded this is equivalent to the situation where the input variables are stored in shared memory one to a cell The output of the PRAM will be in the local memory of processor P at time T One step of a PRAM consists of a parallel write followed by a parallel read It is useful to slightly modify the Priority PRAM We disallow overwriting of mem ory that is a cell may be written into only once To compensate we allow each processor to simultaneously read t cells at step t providing that those cells if they were written into at all were written into at steps t respectively One can prove easily see FMW that for in nite memory this does not decrease the power of the PRAM This is a technical convenience that makes the proof slightly easier Theorem Any Priority CRCW PRAM requires log log log s steps to nd the maximum of n numbers in the domain s when s logn log logn
منابع مشابه
Heuristic approach to solve hybrid flow shop scheduling problem with unrelated parallel machines
In hybrid flow shop scheduling problem (HFS) with unrelated parallel machines, a set of n jobs are processed on k machines. A mixed integer linear programming (MILP) model for the HFS scheduling problems with unrelated parallel machines has been proposed to minimize the maximum completion time (makespan). Since the problem is shown to be NP-complete, it is necessary to use heuristic methods to ...
متن کاملA New Lower Bound for Flexible Flow Shop Problem with Unrelated Parallel Machines
Flexible flow shop scheduling problem (FFS) with unrelated parallel machines contains sequencing in flow shop where, at any stage, there exists one or more processors. The objective consists of minimizing the maximum completion time. Because of NP-completeness of FFS problem, it is necessary to use heuristics method to address problems of moderate to large scale problem. Therefore, for assessme...
متن کاملA Single Machine Sequencing Problem with Idle Insert: Simulated Annealing and Branch-and-Bound Methods
In this paper, a single machine sequencing problem is considered in order to find the sequence of jobs minimizing the sum of the maximum earliness and tardiness with idle times (n/1/I/ETmax). Due to the time complexity function, this sequencing problem belongs to a class of NP-hard ones. Thus, a special design of a simulated annealing (SA) method is applied to solve such a hard problem. To co...
متن کاملA Novel B and B Algorithm for a Unrelated Parallel Machine Scheduling Problem to Minimize the Total Weighted Tardiness
This paper presents a scheduling problem with unrelated parallel machines and sequencedependent setup times that minimizes the total weighted tardiness. A new branch-and-bound (B and B) algorithm is designed incorporating the lower and upper bounding schemes and several dominance properties. The lower and upper bounds are derived through an assignment problem and the composite dispatching rule ...
متن کاملGaussian Z Channel with Intersymbol Interference
In this paper, we derive a capacity inner bound for a synchronous Gaussian Z channel with intersymbol interference (ISI) under input power constraints. This is done by converting the original channel model into an n-block memoryless circular Gaussian Z channel (n-CGZC) and successively decomposing the n-block memoryless channel into a series of independent parallel channels in the frequency dom...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002